testingiauh912fandomcom-20200214-history
IRT SINA ASHRAFI
SINA ASHRAFI ' ' ' ' ITEM RESPONSE THEORY Since the 1960s there has been a growing interest in item response theory (IRT), a term which covers a range of models used to score tests. All of these models assume that a test is unidimensional, as described above. Further, they assume that an observed score is indicative of a person’s ability on an underlying construct, which is often referred to as a latent trait. The trait is assumed to be a continuous, unobservable variable. Using item response theory test designers can create a scale, usually scaled from around – 4 to + 4, upon which all items or tasks can be placed, and the value on the scale is the item difficulty. Test takers who take the items can be placed on to the same scale, and their scores are interpreted as person ability.As such, there is a direct connection between ability and difficulty. All models assume that when student ability = item difficulty, the probability of a test taker getting the item correct will be 0.5. On a scale, therefore, some items would have lower values (be easier), and the probability of more able students passing the item would be very high. As the difficulty value of the item rises, a test taker must be more able to have a chance of getting the item correct. For our ten items, the difficulty estimates produced by an analysis using a oneparameter Rasch model ''are shown in Table A7.4. The first thing to notice is that each estimate of difficulty has its own standard error, something that is not possible in classical test theory. This additional information is also provided in the estimate of person ability. The problem in classical test theory is that the standard error of measurement is more accurate at the test mean, but, as scores diverge from the mean,we know that measurement error increases.Consider Table A7.5, which gives ability estimates for our twenty test takers. You will notice that as the score moves further from the mean (which is zero), the standard error of measurement increases. Where a person is labelled as a ''misfit ''this is not pejorative. IRT applies a probabilistic model to the actual data. If the model cannot account for the data, a person or item is flagged as ''misfitting.What this means is that ‘an instance of person misfit can usually be attributed to anomalous test-taking behaviour of some kind’ (Baker, 1997: 41). Such ‘anomalous behaviour’ may include cheating. In our case, two persons had the maximum score of 10 and therefore could not be modelled, and one person got an item wrong that a person of his ability would have been expected to get correct. Apart from the advantage of providing multiple error terms, IRT is also samplefree. That is, the item difficulty estimates are not dependent upon the sample used to generate them.As long as the sample was drawn from the population of interest, the estimates should be independent of the sample used (Crocker and Algina, 1986: 363).This also means that it is not necessary for every test taker to take every item in a pool in order to ensure that the item statistics are meaningful. Because of these properties, IRT has become the scoring method of choice for computer-based and computer-adaptive tests, where new items are selected according to difficulty on the basis of the current estimate of test taker ability (see Fulcher, 2000b).